NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

On Generalization Bounds for Neural Networks with Low Rank Layers

Pinto, Andrea; Rangamani, Akshay; Poggio, Tomaso (December 2024, Center for Brains, Minds and Machines (CBMM))

While previous optimization results have suggested that deep neural networks tend to favour low-rank weight matrices, the implications of this inductive bias on generalization bounds remain underexplored. In this paper, we apply a chain rule for Gaussian complexity (Maurer, 2016a) to analyze how low-rank layers in deep networks can prevent the accumulation of rank and dimensionality factors that typically multiply across layers. This approach yields generalization bounds for rank and spectral norm constrained networks. We compare our results to prior generalization bounds for deep networks, highlighting how deep networks with low-rank layers can achieve better generalization than those with full-rank layers. Additionally, we discuss how this framework provides new perspectives on the generalization capabilities of deep networks exhibiting neural collapse. Keywords: Gaussian complexity, Generalization bounds, Neural collapse, Low rank layers
more » « less
Full Text Available
The Janus effects of SGD vs GD: high noise and low rank

Xu, Mengjia; Galanti, Tomer; Rosasco, Lorenzo; Rangamani, Akshay; Pinto, Andrea; Poggio, Tomaso (February 2024, Center for Brains, Minds and Machines (CBMM))

It was always obvious that SGD with small minibatch size yields for neural networks much higher asymptotic fluctuations in the updates of the weight matrices than GD. It has also been often reported that SGD in deep RELU networks shows empirically a low-rank bias in the weight matrices. A recent theoretical analysis derived a bound on the rank and linked it to the size of the SGD fluctuations [25]. In this paper, we provide an empirical and theoretical analysis of the convergence of SGD vs GD, first for deep RELU networks and then for the case of linear regression, where sharper estimates can be obtained and which is of independent interest. In the linear case, we prove that the component $$W^\perp$$ of the matrix $$W$$ corresponding to the null space of the data matrix $$X$$ converges to zero for both SGD and GD, provided the regularization term is non-zero. Because of the larger number of updates required to go through all the training data, the convergence rate {\it per epoch} of these components is much faster for SGD than for GD. In practice, SGD has a much stronger bias than GD towards solutions for weight matrices $$W$$ with high fluctuations -- even when the choice of mini batches is deterministic -- and low rank, provided the initialization is from a random matrix. Thus SGD with non-zero regularization, shows the coupled phenomenon of asymptotic noise and a low-rank bias-- unlike GD.
more » « less
Full Text Available
Feature learning in deep classifiers through Intermediate Neural Collapse

Rangamani, Akshay; Lindegaard, Marius; Galanti, Tomer; Poggio, Tomaso (February 2023, Center for Brains, Minds and Machines (CBMM))

In this paper, we conduct an empirical study of the feature learning process in deep classifiers. Recent research has identified a training phenomenon called Neural Collapse (NC), in which the top-layer feature embeddings of samples from the same class tend to concentrate around their means, and the top layer’s weights align with those features. Our study aims to investigate if these properties extend to intermediate layers. We empirically study the evolution of the covariance and mean of representations across different layers and show that as we move deeper into a trained neural network, the within-class covariance decreases relative to the between-class covariance. Additionally, we find that in the top layers, where the between-class covariance is dominant, the subspace spanned by the class means aligns with the subspace spanned by the most significant singular vector components of the weight matrix in the corresponding layer. Finally, we discuss the relationship between NC and Associative Memories (Willshaw et al., 1969).
more » « less
Full Text Available
For interpolating kernel machines, minimizing the norm of the ERM solution maximizes stability

https://doi.org/10.1142/S0219530522400115

Rangamani, Akshay; Rosasco, Lorenzo; Poggio, Tomaso (January 2023, Analysis and Applications)

In this paper, we study kernel ridge-less regression, including the case of interpolating solutions. We prove that maximizing the leave-one-out ([Formula: see text]) stability minimizes the expected error. Further, we also prove that the minimum norm solution — to which gradient algorithms are known to converge — is the most stable solution. More precisely, we show that the minimum norm interpolating solution minimizes a bound on [Formula: see text] stability, which in turn is controlled by the smallest singular value, hence the condition number, of the empirical kernel matrix. These quantities can be characterized in the asymptotic regime where both the dimension ([Formula: see text]) and cardinality ([Formula: see text]) of the data go to infinity (with [Formula: see text] as [Formula: see text]). Our results suggest that the property of [Formula: see text] stability of the learning algorithm with respect to perturbations of the training set may provide a more general framework than the classical theory of Empirical Risk Minimization (ERM). While ERM was developed to deal with the classical regime in which the architecture of the learning network is fixed and [Formula: see text], the modern regime focuses on interpolating regressors and overparameterized models, when both [Formula: see text] and [Formula: see text] go to infinity. Since the stability framework is known to be equivalent to the classical theory in the classical regime, our results here suggest that it may be interesting to extend it beyond kernel regression to other overparameterized algorithms such as deep networks.
more » « less
Full Text Available
Dynamics in Deep Classifiers Trained with the Square Loss: Normalization, Low Rank, Neural Collapse, and Generalization Bounds

https://doi.org/10.34133/research.0024

Xu, Mengjia; Rangamani, Akshay; Liao, Qianli; Galanti, Tomer; Poggio, Tomaso (January 2023, Research)

We overview several properties—old and new—of training overparameterized deep networks under the square loss. We first consider a model of the dynamics of gradient flow under the square loss in deep homogeneous rectified linear unit networks. We study the convergence to a solution with the absolute minimumρ, which is the product of the Frobenius norms of each layer weight matrix, when normalization by Lagrange multipliers is used together with weight decay under different forms of gradient descent. A main property of the minimizers that bound their expected error for a specific network architecture isρ. In particular, we derive novel norm-based bounds for convolutional layers that are orders of magnitude better than classical bounds for dense networks. Next, we prove that quasi-interpolating solutions obtained by stochastic gradient descent in the presence of weight decay have a bias toward low-rank weight matrices, which should improve generalization. The same analysis predicts the existence of an inherent stochastic gradient descent noise for deep networks. In both cases, we verify our predictions experimentally. We then predict neural collapse and its properties without any specific assumption—unlike other published proofs. Our analysis supports the idea that the advantage of deep networks relative to other classifiers is greater for problems that are appropriate for sparse deep architectures such as convolutional neural networks. The reason is that compositionally sparse target functions can be approximated well by “sparse” deep networks without incurring in the curse of dimensionality.
more » « less
Full Text Available
Deep Classifiers trained with the Square Loss

Xu, Mengjia; Rangamani, Akshay; Banburski, Andrzej; Liao, Qianli; Galanti, Tomer; Poggio, Tomaso (October 2022, Center for Brains, Minds and Machines (CBMM))

We overview several properties -- old and new -- of training overparametrized deep networks under the square loss. We first consider a model of the dynamics of gradient flow under the square loss in deep homogeneous ReLU networks. We study the convergence to a solution with the absolute minimum $$\rho$$, which is the product of the Frobenius norms of each layer weight matrix, when normalization by Lagrange multipliers (LM) is used together with Weight Decay (WD) under different forms of gradient descent. A main property of the minimizers that bounds their expected error {\it for a specific network architecture} is $$\rho$$. In particular, we derive novel norm-based bounds for convolutional layers that are orders of magnitude better than classical bounds for dense networks. Next we prove that quasi-interpolating solutions obtained by Stochastic Gradient Descent (SGD) in the presence of WD have a bias towards low rank weight matrices -- that, as we also explain, should improve generalization. The same analysis predicts the existence of an inherent SGD noise for deep networks. In both cases, we verify our predictions experimentally. We then predict Neural Collapse and its properties without any specific assumption -- unlike other published proofs. Our analysis supports the idea that the advantage of deep networks relative to other classifiers is greater for the problems that are appropriate for sparse deep architectures such as CNNs. The deep reason compositionally sparse target functions can be approximated well by ``sparse'' deep networks without incurring in the curse of dimensionality.
more » « less
Full Text Available
Detection and Confirmation of Multiple Human Targets Using Pixel-Wise Code Aperture Measurements

https://doi.org/10.3390/jimaging6060040

Kwan, Chiman; Gribben, David; Rangamani, Akshay; Tran, Trac; Zhang, Jack; Etienne-Cummings, Ralph (June 2020, Journal of Imaging)

Compressive video measurements can save bandwidth and data storage. However, conventional approaches to target detection require the compressive measurements to be reconstructed before any detectors are applied. This is not only time consuming but also may lose information in the reconstruction process. In this paper, we summarized the application of a recent approach to vehicle detection and classification directly in the compressive measurement domain to human targets. The raw videos were collected using a pixel-wise code exposure (PCE) camera, which condensed multiple frames into one frame. A combination of two deep learning-based algorithms (you only look once (YOLO) and residual network (ResNet)) was used for detection and confirmation. Optical and mid-wave infrared (MWIR) videos from a well-known database (SENSIAC) were used in our experiments. Extensive experiments demonstrated that the proposed framework was feasible for target detection up to 1500 m, but target confirmation needs more research.
more » « less
Full Text Available
Target tracking and classification using compressive sensing camera for SWIR videos

https://doi.org/10.1007/s11760-019-01506-4

Kwan, Chiman; Chou, Bryan; Yang, Jonathan; Rangamani, Akshay; Tran, Trac; Zhang, Jack; Etienne-Cummings, Ralph (January 2019, Signal, Image and Video Processing)

The pixel-wise code exposure (PCE) camera is a compressive sensing camera that has several advantages, such as low power consumption and high compression ratio.Moreover, one notable advantage is the capability to control individual pixel exposure time. Conventional approaches of using PCE cameras involve a time-consuming and lossy process to reconstruct the original frames and then use those frames for target tracking and classification. Otherwise, conventional approaches will fail if compressive measurements are used. In this paper, we present a deep learning approach that directly performs target tracking and classification in the compressive measurement domain without any frame reconstruction. Our approach has two parts: tracking and classification. The tracking has been done via detection using You Only Look Once (YOLO), and the classification is achieved using residual network (ResNet). Extensive simulations using short-wave infrared (SWIR) videos demonstrated the efficacy of our proposed approach.
more » « less
Full Text Available

Search for: All records